Concepts, Challenges, and Strategies to Address Missing Data
2025-11-20
function (n = 1000, beta_x = 0.5, beta_e = 1, sd_error = 2, mcar_rate = 0.5,
mar_logit_shift = 1.2, mar_depend = c("E", "X"))
{
mar_depend <- match.arg(mar_depend)
E <- rbinom(n, 1, 0.5)
X <- 1.5 * E + rnorm(n)
Y <- 2 + beta_x * X + beta_e * E + rnorm(n, sd = sd_error)
p_miss_mcar <- rep(mcar_rate, n)
miss_mcar <- rbinom(n, 1, p_miss_mcar)
X_mcar <- ifelse(miss_mcar == 1, NA_real_, X)
mar_var <- if (mar_depend == "E") {
1 - E
}
else {
-as.numeric(scale(X))
}
alpha <- if (mar_logit_shift == 0) {
qlogis(mcar_rate)
}
else {
uniroot(f = function(a) mean(plogis(a + mar_logit_shift *
mar_var)) - mcar_rate, interval = c(-15, 15))$root
}
p_miss_mar <- plogis(alpha + mar_logit_shift * mar_var)
miss_mar <- rbinom(n, 1, p_miss_mar)
X_mar <- ifelse(miss_mar == 1, NA_real_, X)
tibble(E, X, Y, X_mcar, miss_mcar, X_mar, miss_mar)
}
X → Birthweight YE affects both nutrition and birthweight
Call:
lm(formula = Y ~ X + E, data = sim_data)
Residuals:
Min 1Q Median 3Q Max
-7.6783 -1.3513 -0.0298 1.3986 7.7126
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.98314 0.04027 49.24 <2e-16 ***
X 0.50024 0.02878 17.38 <2e-16 ***
E 1.00465 0.07198 13.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.022 on 4997 degrees of freedom
Multiple R-squared: 0.2006, Adjusted R-squared: 0.2003
F-statistic: 627.1 on 2 and 4997 DF, p-value: < 2.2e-16
X for a subset of mothers.# Ensure base data exists (5000 rows defined earlier)
if (!exists("sim_data")) {
sim_data <- simulate_birth_data(
n = 5000,
beta_x = beta_x_true,
beta_e = beta_e_true,
sd_error = sd_error_true
)
}
# Full data model (gold standard) on the complete data
full_fit <- lm(Y ~ X + E, data = sim_data)
# Randomly mark 50% of X as missing (MCAR) for this demo
mcar_data <- sim_data |>
mutate(
drop_flag = rbinom(n(), 1, 0.5),
X_mcar = if_else(drop_flag == 1, NA_real_, X)
)# Only keep rows where X_mcar is observed (listwise deletion)
mcar_cc <- mcar_data |> filter(!is.na(X_mcar))
mcar_fit <- lm(Y ~ X_mcar + E, data = mcar_cc)
nrow(mcar_cc)[1] 2522
Call:
lm(formula = Y ~ X_mcar + E, data = mcar_cc)
Residuals:
Min 1Q Median 3Q Max
-7.7022 -1.3534 0.0086 1.3901 7.7041
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.97448 0.05698 34.65 <2e-16 ***
X_mcar 0.47749 0.04102 11.64 <2e-16 ***
E 1.02528 0.10239 10.01 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.043 on 2519 degrees of freedom
Multiple R-squared: 0.1916, Adjusted R-squared: 0.1909
F-statistic: 298.5 on 2 and 2519 DF, p-value: < 2.2e-16
Call:
lm(formula = Y ~ X_mcar + E, data = mcar_cc)
Coefficients:
(Intercept) X_mcar E
1.9745 0.4775 1.0253
# A tibble: 2 × 3
method estimate se
<chr> <dbl> <dbl>
1 Full data (truth) 0.500 0.0288
2 MCAR listwise 0.477 0.0410
MCAR: estimates center on truth; uncertainty widens as more X is missing.
We will look at:
x missing)| scenario | N | Y mean | Y sd | Y bias | X mean | X sd | X bias | E mean | E sd | E bias |
|---|---|---|---|---|---|---|---|---|---|---|
| Full data | 20000 | 2.87 | 2.24 | 0 | 0.73 | 1.25 | 0.00 | 0.49 | 0.5 | 0 |
| 10% missing, E-driven | 17898 | 2.87 | 2.24 | 0 | 0.81 | 1.25 | 0.08 | 0.49 | 0.5 | 0 |
| 30% missing, E-driven | 14020 | 2.87 | 2.24 | 0 | 0.99 | 1.23 | 0.26 | 0.49 | 0.5 | 0 |
| 60% missing, E-driven | 8121 | 2.87 | 2.24 | 0 | 1.29 | 1.12 | 0.56 | 0.49 | 0.5 | 0 |
Y, X, and p(x missing); these models are mis-specified without E| Scenario | N | β_X | SE(β_X) |
|---|---|---|---|
| Full data | 20000 | 0.734 | 0.012 |
| 10% missing, E-driven | 17898 | 0.734 | 0.012 |
| 30% missing, E-driven | 14020 | 0.710 | 0.014 |
| 60% missing, E-driven | 8121 | 0.626 | 0.020 |
| Scenario | N | β_X | SE(β_X) | β_E | SE(β_E) |
|---|---|---|---|---|---|
| Full data | 20000 | 0.497 | 0.014 | 0.988 | 0.035 |
| 10% missing, E-driven | 17898 | 0.496 | 0.015 | 1.001 | 0.037 |
| 30% missing, E-driven | 14020 | 0.500 | 0.017 | 0.945 | 0.044 |
| 60% missing, E-driven | 8121 | 0.489 | 0.022 | 1.015 | 0.073 |
E), listwise deletion analyses stay near the truth.| Scenario | Missing | N listwise | β_X listwise | SE listwise | N | β_X | SE(β_X) | β_X bias | X mean | X sd |
|---|---|---|---|---|---|---|---|---|---|---|
| Full data | 0% | 20000 | 0.497 | 0.014 | 20000 | 0.497 | 0.014 | -0.003 | 0.73 | 1.25 |
| 10% missing, E-driven | 11% | 17898 | 0.496 | 0.015 | 20000 | 0.470 | 0.015 | -0.030 | 0.81 | 1.18 |
| 30% missing, E-driven | 30% | 14020 | 0.500 | 0.017 | 20000 | 0.410 | 0.016 | -0.090 | 0.99 | 1.03 |
| 60% missing, E-driven | 59% | 8121 | 0.489 | 0.022 | 20000 | 0.407 | 0.021 | -0.093 | 1.29 | 0.71 |
miss_ind = 1 when \(X\) was imputed, and fit \(Y \sim X + E + miss\_ind\).| Scenario | Missing | N listwise | β_X listwise | SE listwise | N | β_X | SE(β_X) | β_X bias | X mean | X sd |
|---|---|---|---|---|---|---|---|---|---|---|
| Full data | 0% | 20000 | 0.497 | 0.014 | 20000 | 0.497 | 0.014 | -0.003 | 0.73 | 1.25 |
| 10% missing, E-driven | 11% | 17898 | 0.496 | 0.015 | 20000 | 0.492 | 0.015 | -0.008 | 0.81 | 1.18 |
| 30% missing, E-driven | 30% | 14020 | 0.500 | 0.017 | 20000 | 0.459 | 0.016 | -0.041 | 0.99 | 1.03 |
| 60% missing, E-driven | 59% | 8121 | 0.489 | 0.022 | 20000 | 0.409 | 0.021 | -0.091 | 1.29 | 0.71 |
| Scenario | Missing | N listwise | β_X listwise | SE listwise | N | β_X | SE(β_X) | β_X bias | X mean | X sd |
|---|---|---|---|---|---|---|---|---|---|---|
| Full data | 0% | 20000 | 0.497 | 0.014 | 20000 | 0.497 | 0.014 | -0.003 | 0.73 | 1.25 |
| 10% missing, E-driven | 11% | 17898 | 0.496 | 0.015 | 20000 | 0.392 | 0.014 | -0.108 | 0.81 | 1.25 |
| 30% missing, E-driven | 30% | 14020 | 0.500 | 0.017 | 20000 | 0.265 | 0.013 | -0.235 | 0.99 | 1.23 |
| 60% missing, E-driven | 59% | 8121 | 0.489 | 0.022 | 20000 | 0.147 | 0.013 | -0.353 | 1.28 | 1.13 |
| Scenario | Missing | N listwise | β_X listwise | SE listwise | N | β_X | SE(β_X) | β_X bias | X mean | X sd |
|---|---|---|---|---|---|---|---|---|---|---|
| Full data | 0% | 20000 | 0.497 | 0.014 | 20000 | 0.497 | 0.014 | -0.003 | 0.73 | 1.25 |
| 10% missing, E-driven | 11% | 17898 | 0.496 | 0.015 | 20000 | 0.496 | 0.015 | -0.004 | 0.73 | 1.21 |
| 30% missing, E-driven | 30% | 14020 | 0.500 | 0.017 | 20000 | 0.500 | 0.017 | 0.000 | 0.73 | 1.13 |
| 60% missing, E-driven | 59% | 8121 | 0.489 | 0.022 | 20000 | 0.489 | 0.023 | -0.011 | 0.74 | 0.98 |
| Scenario | Missing | N listwise | β_X listwise | SE listwise | β_X (MI) | SE (MI) | β_X bias (MI) |
|---|---|---|---|---|---|---|---|
| Full data | 0% | 20000 | 0.497 | 0.014 | 0.497 | 0.014 | -0.003 |
| 10% missing, E-driven | 11% | 17898 | 0.496 | 0.015 | 0.502 | 0.015 | 0.002 |
| 30% missing, E-driven | 30% | 14020 | 0.500 | 0.017 | 0.495 | 0.017 | -0.005 |
| 60% missing, E-driven | 59% | 8121 | 0.489 | 0.022 | 0.462 | 0.027 | -0.038 |
Y, analysts often hesitate to use Y to impute XY out you replicate the listwise bias because the imputations can’t see the selection mechanism.| Method | $\hat{\beta}_X$ | SE | Bias |
|---|---|---|---|
| Full data (truth) | 0.478 | 0.025 | -0.022 |
| Listwise deletion | 0.408 | 0.032 | -0.092 |
| Multiple imputation | 0.510 | 0.034 | 0.010 |
X| scenario | N | Y mean | Y sd | Y bias | X mean | X sd | X bias | E mean | E sd | E bias |
|---|---|---|---|---|---|---|---|---|---|---|
| Full data | 20000 | 2.87 | 2.24 | 0 | 0.73 | 1.25 | 0.00 | 0.49 | 0.5 | 0 |
| 10% missing, X-driven | 18018 | 2.87 | 2.24 | 0 | 0.96 | 1.09 | 0.23 | 0.49 | 0.5 | 0 |
| 30% missing, X-driven | 13971 | 2.87 | 2.24 | 0 | 1.33 | 0.91 | 0.60 | 0.49 | 0.5 | 0 |
| 60% missing, X-driven | 8000 | 2.87 | 2.24 | 0 | 1.90 | 0.74 | 1.17 | 0.49 | 0.5 | 0 |
| Scenario | N | β_X | SE(β_X) |
|---|---|---|---|
| Full data | 20000 | 0.734 | 0.012 |
| 10% missing, X-driven | 18018 | 0.745 | 0.014 |
| 30% missing, X-driven | 13971 | 0.744 | 0.019 |
| 60% missing, X-driven | 8000 | 0.717 | 0.031 |
| Scenario | N | β_X | SE(β_X) | β_E | SE(β_E) |
|---|---|---|---|---|---|
| Full data | 20000 | 0.497 | 0.014 | 0.988 | 0.035 |
| 10% missing, X-driven | 18018 | 0.493 | 0.017 | 0.985 | 0.036 |
| 30% missing, X-driven | 13971 | 0.501 | 0.021 | 0.998 | 0.040 |
| 60% missing, X-driven | 8000 | 0.560 | 0.032 | 0.886 | 0.059 |
| Scenario | Missing | N listwise | β_X listwise | SE listwise | β_X (MI) | SE (MI) | β_X bias (MI) |
|---|---|---|---|---|---|---|---|
| Full data | 0% | 20000 | 0.497 | 0.014 | 0.497 | 0.014 | -0.003 |
| 10% missing, X-driven | 10% | 18018 | 0.493 | 0.017 | 0.494 | 0.017 | -0.006 |
| 30% missing, X-driven | 30% | 13971 | 0.501 | 0.021 | 0.502 | 0.023 | 0.002 |
| 60% missing, X-driven | 60% | 8000 | 0.560 | 0.032 | 0.553 | 0.028 | 0.053 |
| Sample | Mean X | SD X | Mean educ | Mean insurance | Mean Y | SD Y |
|---|---|---|---|---|---|---|
| Full data | 0.00 | 1.00 | 0.00 | 0.00 | 3.03 | 2.10 |
| Listwise | 0.28 | 0.97 | 0.33 | 0.29 | 3.45 | 2.04 |
| MI (avg) | 0.13 | 0.98 | 0.00 | 0.00 | 3.03 | 2.10 |
| Method | β_X | SE | Bias |
|---|---|---|---|
| Omniscient (X + hidden drivers) | 0.484 | 0.052 | -0.016 |
| Observed, no missing | 0.563 | 0.050 | 0.063 |
| Listwise deletion | 0.564 | 0.069 | 0.064 |
| Multiple imputation | 0.585 | 0.067 | 0.085 |